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Abstract.  This  paper  addresses  the  task  of  detecting  intrusions  in  the 
form  of  malicious  attacks  on  programs  running  on  a  host  computer  sys¬ 
tem  by  inspecting  the  trace  of  system  calls  made  by  these  programs. 
We  use  ‘attack-tree’  type  generative  models  for  such  intrusions  to  se¬ 
lect  features  that  are  used  by  a  Support  Vector  Machine  Classifier.  Our 
approach  combines  the  ability  of  an  HMM  generative  model  to  handle 
variable-length  strings,  i.e.  the  traces,  and  the  non-asymptotic  nature 
of  Support  Vector  Machines  that  permits  them  to  work  well  with  small 
training  sets. 


1  Introduction 

This  article  concerns  the  task  of  monitoring  programs  and  processes  running  on 
computer  systems  to  detect  break-ins  or  misuse.  For  example,  programs  like 
sendmail  and  finger  on  the  UNIX  operating  system  run  with  administra¬ 
tive  privileges  and  are  susceptible  to  misuse  because  of  design  short-comings. 
Any  user  can  pas  specially  crafted  inputs  to  these  programs  and  effect  ‘Buffer- 
overflow’  (or  some  such  exploit)  and  break  into  the  system.  To  detect  such 
attacks,  the  execution  of  vulnerable  programs  should  be  screened  at  run-time. 
This  can  be  done  by  observing  the  trace  (sequence  of  operating  system  calls;  with 
or  without  argument  values)  of  the  program.  In  [3],  S.  Hofmeyr  et.al.  describe  a 
method  of  learning  to  discriminate  between  sequences  of  system  calls  (without 
argument  values)  generated  by  normal  use  and  misuse  of  processes  that  run  with 
(root)  privileges.  In  their  scheme,  a  trace  is  flagged  to  be  anomalous  if  its  similar¬ 
ity  to  example  (training)  traces  annotated  as  normal  falls  below  a  threshold;  the 
similarity  measure  is  based  on  the  extent  of  partial  matches  with  short  sequences 
derived  from  the  training  traces.  From  annotated  examples  of  traces,  they  com¬ 
pile  a  list  of  subsequences  for  comparing  (at  various  positions)  with  a  given  trace 
and  flag  anomalous  behavior  when  a  similarity  measure  crosses  a  threshold.  In 
[15],  A.  Wespi  et.  al.  use  the  Teiresias  pattern  matching  algorithm  on  the  traces 
in  a  similar  manner  to  flag  off  anomalous  behavior.  In  both  of  the  above,  the 
set  of  subsequences  used  for  comparison  has  to  be  learnt  from  the  annotated 
set  of  traces  (sequences  of  system  calls)  because,  no  other  usable  information  or 
formal  specification  on  legal  or  compromised  execution  of  programs  is  available. 


The  approach  advocated  in  this  article  is  to  obtain  a  compact  representation  of 
program  behavior  and  use  it  (after  some  reduction)  to  select  features  to  be  used 
with  a  Support  Vector  Machine  learning  classifier. 

Let  <3f  be  the  set  of  all  possible  system  calls.  A  trace  y  is  then  an  element  of 
which  is  the  set  of  all  strings  composed  of  elements  of  .  For  a  given  program, 
let  the  training  set  be  ST  =  {(Vi,Li)|i  =  1,  ...,T},  where  Lt  e  {0,1},  is  the 
label  corresponding  to  trace  Vi,  0  for  normal  traces  and  1  for  attack  traces.  The 
detection  problem  then  is  to  come  up  with  a  rule  L ,  based  on  the  training  set,  that 
attempts  to  minimize  the  probability  of  misclassification  Pe  =  Pr[L(y)  ^  L{y)\. 
What  is  of  more  interest  to  system  administrators  is  the  trade-off  between  the 
probability  of  detection  Pd  =  Pr[L(y)  =  1|L(V)  =  1]  and  the  probability  of 
false  alarms  Pfa  =  Pr[L(y)  =  1| L(y)  =  0]  that  the  classifier  provides.  These 
probabilities  are  independent  of  the  probabilities  of  occurrence  of  normal  and 
malicious  traces. 

Annotation  (usually  manual)  of  live  traces  is  a  difficult  and  slow  procedure. 
Attacks  are  also  rare  occurrences.  Hence,  traces  corresponding  to  attacks  are 
few  in  number.  Likewise,  we  clont  even  have  a  good  representative  sample  of 
traces  corresponding  to  normal  use.  Hence,  regardless  of  the  features  used,  we 
need  to  use  non-parametric  classifiers  that  can  handle  finite  (small)  training 
sets.  Support  Vector  Machine  learning  carves  out  a  decision  rule  reflecting  the 
complicated  statistical  relationships  amongst  features  from  finite  training  sets 
by  maximizing  true  generalization  (strictly  speaking,  a  bound  on  generalization) 
instead  of  just  the  performance  on  the  training  set.  To  use  Support  Vector  Ma¬ 
chines,  we  need  to  map  each  variable  length  trace  into  a  (real-vector-valued) 
feature  space  where  Kernel  functions  (section  4)  can  be  used.  This  conversion  is 
performed  by  parsing  the  raw  traces  into  shorter  strings  and  extracting  models 
of  program  execution  from  them. 


2  MODELS  FOR  ATTACKS 

The  malicious  nature  of  a  program  is  due  to  the  presence  of  a  subsequence,  not 
necessarily  contiguous,  in  its  trace  of  system  calls.  For  the  same  type  of  attack 
on  the  host,  there  are  several  different  combinations  of  system  calls  that  can 
be  used.  Furthermore,  innocuous  system  calls  or  sequences  can  be  injected  into 
various  stages  of  program  execution  (various  segments  of  the  traces).  Thus  the 
intrinsic  variety  of  attack  sequences  and  the  padding  with  harmless  calls  leads 
to  a  polymorphism  of  traces  for  the  same  plan  of  attack.  Real  attacks  have  a 
finite  (and  not  too  long)  underlying  attack  sequence  of  system  calls  because  they 
target  specific  vulnerabilities  of  the  host.  This  and  the  padding  are  represented 
in  a  ‘plan  of  attack’  called  the  Attack  Tree  [12]. 


2.1  Attack  Trees 

An  Attack  Tree  (^/)[12]  is  a  directed  acyclic  graph  (DAG)  with  a  set  of  nodes 
and  associated  sets  of  system  calls  used  at  these  nodes.  It  represents  a  hierarchy 


of  pairs  of  tasks  and  methods  to  fulfill  those  tasks.  These  nodes  and  sets  of 
system  calls  are  of  the  following  three  types: 

1.  V  =  {vi,V2,  ■  ■  ■  ,  u^},  the  nodes  representing  the  targeting  of  specific  vul¬ 
nerabilities  in  the  host  system,  and  a  corresponding  collection  of  subsets  of 
3i  :  3i'v  =  {@i,  3^2  ,  ■  ■  ■ ,  £?£(}  representing  the  possible  system-calls  that 
target  those  vulnerabilities. 

2.  £?  =  {pi,  p2, . . . ,  pk2},  the  set  of  instances  where  padding  can  be  done 
along  with  a  corresponding  collection  of  subsets  of  <3f  U  {e}  (e  is  the 
null  alphabet  signifying  that  no  padding  system-call  has  been  included): 
ayg*  _  {ay&  ay®  3^} 

3.  3P  =  {/i,  /2, . . . ,  /fc3},  the  final  states  into  which  the  scheme  jumps  after 
completion  of  the  attack  plan  along  with  a  collection  of  subsets  of  U  {e}: 

, . . . ,  a  set  that  is  not  of  much  interest  from  the  point 

of  view  of  detecting  attacks. 

There  may  be  multiple  system  calls  issued  while  at  a  state  with  possible  restric¬ 
tions  on  the  sequence  of  issue.  The  basic  attack  scheme  encoded  in  the  Attack 
Tree  is  not  changed  by  modifications  such  as  altering  the  padding  scheme  or  the 
amount  of  padding  (time  spent  in  the  padding  nodes).  Given  an  attack  tree,  it 
is  straightforward  to  find  the  list  (jSf^  C  <3f*')  of  all  traces  that  it  can  generate. 
But  given  a  trace,  we  don’t  have  a  scheme  to  check  if  it  could  have  been  gen¬ 
erated  by  si  without  searching  through  the  list  .  Our  intrusion  detection 
scheme  needs  to  execute  the  following  steps: 

1.  Learn  about  si  from  the  training  set  3? . 

2.  Form  a  rule  to  determine  the  likelihood  of  a  given  trace  being  generated  by 

si . 

These  objectives  can  be  met  by  a  probabilistic  modeling  of  the  Attack  Tree. 


2.2  Hidden  Markov  Models  for  Attack  Trees 

Given  an  Attack  Tree  si,  we  can  set  up  an  equivalent  Hidden  Markov  model  H 1 
that  captures  the  uncertainties  in  padding  and  the  polymorphism  of  attacks.  The 
state-space  of  H1,  =  {xj,  x\,  ■  ■  ■ ,  x\}  (the  superscript  1  corresponding  to 

the  attack  model  (abnormal  or  malicious  program)  and  the  superscript  0  corre¬ 
sponding  to  the  normal  program  model)  is  actually  the  union:  {x^}U'^U3gU^r 
with  x„  being  the  start  state  representing  the  start  node  with  no  attack  initiated 
and  n  =  1  +  k\  +  •  We  now  need  to  describe  the  statistics  of  state  tran¬ 

sitions  (with  time  replacing  the  position  index  along  a  trace)  to  reflect  the  edge 
structure  of  si  and  to  also  reflect  the  duration  of  stay  in  the  vulnerability  and 
padding  nodes.  The  only  allowed  state  transitions  are  the  ones  already  in  si  and 
self-loops  at  each  of  the  states.  The  picture  is  completed  by  defining  conditional 
output  probabilities  given  the  state  of  system  calls  in  a  way  that  captures  the 
information  in  and  .  Thus  we  have,  i x\,x{  €  1,Vy,  €  U  {e}  and 


\/t  e  N 

P[X{t  +  1)  =  x\\X{t)  =  Xj\  =  q)i,  (1) 

P[Y(t  +  l)  =  yl\X(t)  =  x1j]  =  r1jl.  (2) 

We  can  write  down  a  similar  HMM  for  the  normal  traces  also.  This  normal 
HMM,  H°  has  as  its  state-space  a  set  3P®  in  general  bigger  than  3Z-1,  and  cer¬ 
tainly  with  a  different  state  transition  structure  and  conditional  output  proba¬ 
bilities  of  system  calls  given  the  state.  The  associated  probabilities  are  as  follows. 
Vx°,Xj  e  5P°,\/yl  €  W  andVtsN 

P[X(t+l)=x°\X(t)  =  x°J}=q%,  (3) 

P[Y(t+l)  =  yl\X(t)=x°j]=rl  (4) 

We  would  like  to  represent  the  probabilities  for  the  above  HMMs  as  functions  of 
some  vector  8  of  real-valued  parameters  so  as  to  be  able  to  use  the  framework  of 
[4]  and  [5].  In  the  next  section,  we  use  these  parametric  HMMs  to  derive  a  real 
valued  feature  vector  of  fixed  dimension  for  these  variable  length  strings  that 
will  enable  us  to  use  Support  Vector  Machines  for  classification. 

3  REAL  VALUED  FEATURE  VECTORS  FROM 
TRACES 


Since  we  are  dealing  with  variable  length  strings,  we  would  like  to  extract  the 
features  living  in  a  subset  of  an  Euclidean  space  on  which  kernel  functions  are 


readily  available  enabling  use  of  Support  Vector  Machines  [14]  [1].  In  [4]  and  [5], 
each  observation  y  is  either  the  output  of  a  parametric  HMM  (Correct  Hypoth¬ 
esis  Hi)  or  not  (Null  Hypothesis  Hq).  Then  we  can  compute  the  Fisher  score: 

Uy  =  V,log(P[V|U1,@])  (5) 


as  the  feature  vector  corresponding  to  each  y,  9  being  the  real-vector  valued 
parameter.  What  is  not  clear  in  this  set-up  is  how,  given  only  the  training  set 
the  Fisher  score  is  computed.  For  instance,  the  ith  entry  of  Uy  will  look  like 


—  log  (P[y\Hu  9)) 
p[y\HuO] 


(6) 


This  could  clearly  depend  on  0.  To  use  some  feature  like  the  Fisher  score,  we 
need  to  identify  a  real-valued  vector  parameter  9  and  to  completely  specify  the 
computation  of  Uy. 

Let  the  sets  of  malicious  and  normal  traces  in  the  training  set  be: 


M  =  {y\(y,L(y))e3r,L(y)  =  i} 
^  =  {y\(y,L(y))e^,L(y)  =  o} 


Let  ni,no  be  the  sizes  of  the  state-spaces  of  the  attack  and  normal  HMMs 
respectively.  For  H 1  we  compute  an  estimate  of  probabilities  H 1  =  {q):),  f(b}, 
based  on  the  Expectation  Maximization  algorithm  [10]  [2].  We  obtain  an  updated 
set  of  estimates  H 1  =  {q)3 ,  rj,- }  that  (locally)  increases  the  likelihood  of  M  (i.e. 
of  the  traces  in  A4)  by  maximizing  the  auxiliary  function  as  below: 

H1  =  arg  max  V  EFfl  [log  P  (V;  H)  [V]  (7) 

£  y&M 

This  step  can  be  executed  by  the  following(in  the  same  manner  as  equation  (44) 
of  [10]): 


qI 


Qji 


Y  -2_P 
'  da* . 

yeM  Jt 
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ji 


(»*‘) 
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dr}. 
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fc= i  Vyevw  lk 
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(*61) 


(8) 

(9) 


where  for  simplicity  the  null  output  e  is  not  considered.  A  variant  of  this  idea  is 
a  scheme  where  instead  of  the  summation  over  all  y  £  M. ,  we  repeat  the  update 


separately  for  each  y  G  M  (in  some  desired  order)  as  follows: 


(10) 

(11) 


H 1  is  set  equal  to  the  update  H 1  and  the  above  steps  are  repeated  till  some 
criterion  of  convergence  is  met.  We  will  now  specify  the  initial  value  of  H 1  with 
which  this  recursion  gets  started.  The  acyclic  nature  of  the  Attack  Tree  means 
that,  with  an  appropriate  relabelling  of  nodes,  the  state-transition  matrix  is 
upper  triangular: 

qji>0^i>  j 

or  block-upper  triangular  if  some  states  (padding  states  for  instance)  are  allowed 
to  communicate  with  each  other.  As  initial  values  for  the  EM  algorithm,  we  can 
take  (the  equi-probable  assignment): 


Qji 


ni-j  +  1 


Vi  >  j 


(12) 


noting  that  equation  (8)  preserves  the  triangularity.  Similarly,  we  can  take: 


r]i  =  -  Vl,j-  (13) 

Since  we  want  to  be  alert  to  variations  in  the  attack  by  padding,  it  is  not  a  good 
idea  to  start  with  a  more  restrictive  initial  assignment  for  the  conditional  output 
probabilities  unless  we  have  reliable  information,  such  as  constraints  imposed  by 
the  Operating  System,  or  ‘tips’  of  an  ‘expert-hacker’  .  Such  system-dependent 
restrictions,  in  the  form  of  constraints  on  some  of  the  probabilities  fur¬ 

ther  focus  our  attention  on  the  real  vulnerabilities  in  the  system.  To  further 
sharpen  our  attention,  we  can  augment  Af,  by  adding  to  it,  its  traces  segmented 
by  comparing  with  the  traces  in  A f  and  using  any  side  information;  essentially 
an  attempt  at  stripping  off  padding.  These  segmented  traces  would  be  given 
smaller  weights  in  the  EM  recursion  (7).  Going  further  in  that  direction,  we  can, 
instead  of  using  the  EM  algorithm  use  various  segmentations  of  the  traces  in 
S'  (into  ra i  parts)  and  estimate  the  probabilities  {(/],,  r\*;}.  Even  though  we  face 
difficulties  such  as  a  large  number  of  unknowns,  a  relatively  small  training  set, 
and  the  problem  of  settling  on  a  local  optimum  point  in  the  EM  algorithm,  we 
are  banking  on  the  robustness  of  the  Support  Vector  Machine  classifier  that  uses 
the  parameters  of  the  generative  model.  We  can  compute  similar  estimates (H°) 
for  the  HMM  representing  the  normal  programs  even  though  they  do  not,  in 
general,  admit  simplifications  like  triangularity  of  the  state-transition  matrix. 


The  parameter  vector  we  are  interested  in  is  the  following: 


I  ^11  ’  Ql2  J  •>  0-2 


QnN  5 


(14) 


.ZV  being  the  larger  of  ni,no;  setting  to  zero  those  probabilities  that  are  not 
defined  in  the  smaller  model.  This  vector  can  be  estimated  for  the  two  HMMs 
H 1,H°  :  01,0°  simply  from  H 1,H°. 

For  any  trace,  be  it  from  SF  or  from  the  testing  set,  we  can  define  the  following 
feature  vector: 

Uy=  [V 9  log  (P{y\H\  9})  \9=$1]  (15) 

This  vector  measures  the  likelihood  of  a  given  trace  being  the  output  of  the 
Attack  Tree  model  and  can  be  the  basis  of  a  Signature-based  Intrusion  Detection 
Scheme.  On  the  other  hand,  we  can  use  the  information  about  normal  programs 
gathered  in  H°  to  come  up  with 


v, log (p[y \H\e\)  \e=6l 
Ve  log  (p[y\H°,e})  \9=$0 


(16) 


which  can  be  used  for  a  Combined  Signature  and  Anomaly-based  detection.  Some¬ 
thing  to  be  kept  in  mind  is  that  the  parameter  vector  (and  hence  the  feature 
vectors)  defined  by  (14)  will  contain  many  useless  entries  (with  values  zero)  be¬ 
cause  we  do  not  use  the  triangularity  of  the  state-transition  matrix  for  H 1  or 
any  system  dependent  restrictions  and  because  we  artificially  treat  (in  (14))  the 
HMMs  to  be  of  equal  size.  Instead,  we  can  define  different  (smaller)  parameter 
vectors  9m  and  9jy  for  the  malicious  and  normal  HMMs  respectively  and  con¬ 
siderably  shrink  the  feature  vectors.  Also  for  each  feature  vector  in  (15)  and  the 
two  ‘halves’  of  the  vector  in  (16),  there  is  a  constant  scaling  factor  in  the  form 
of  the  reciprocal  of  the  likelihood  of  the  trace  given  the  HMM  of  interest  (as 
displayed  in  equation  (6)).  This  constant  scaling  tends  to  be  large  because  of  the 
smallness  of  the  concerned  likelihoods.  We  can  store  this  likelihood  as  a  separate 
entry  in  the  feature  vector  without  any  loss  of  information.  A  similar  issue  crops 
up  in  the  implementation  of  the  EM  algorithm:  the  forward  and  backward  prob¬ 
abilities  needed  for  the  computations  in  (8),  (9),  (10)  and  (11),  tend  to  become 
very  small  for  long  observation  sequences,  making  it  important  to  have  a  high 
amount  of  decimal  precision 


4  THE  SVM  ALGORITHM  AND  NUMERICAL 
EXPERIMENTS 

Support  Vector  Machines  (SVMs)  [14]  are  non-parametric  classifiers  designed  to 
provide  good  generalization  performance  even  on  small  training  sets.  A  SVM 
maps  input  (real-valued)  feature  vectors  (x  £  X  with  labels  y  £  Y)  into  a 
(much)  higher  dimensional  feature  space  (z  £  Z )  through  some  nonlinear  map¬ 
ping  (something  that  captures  the  nonlinearity  of  the  true  decision  boundary). 


In  a  feature  space,  we  can  classify  the  labelled  feature  vectors  ( Zi,yi )  using 
hyper-planes: 

Di[<  Zi,w  >  +6]  >  1  (17) 

and  minimize  the  functional  <P(w )  =  P  <  w,w  >.  The  solution  to  this  quadratic 
program  can  be  obtained  from  the  saddle  point  of  the  Lagrangian: 

L(w,  b,  a)  =  i  <  w,w  >  -  ^  (yi[<  Zi,  w  >  +b]  -  1)  (18) 

w*  =  y 'yia*Zi,  a*  >  0;  (19) 

Those  input  feature  vectors  in  the  training  set  that  have  positive  a*  are  called 
Support  Vectors  S  =  {zi\a*  >  0}  and  because  of  the  Karush-Kuhn-Tucker  opti¬ 
mality  conditions,  the  optimal  weight  can  be  expressed  in  terms  of  the  Support 
Vectors  alone. 

w*  =  ^  yiOt*Zi,  a*  >  0;  (20) 

S 

This  determination  of  w  fixes  the  optimal  separating  hyper-plane.  The  above 
method  has  the  daunting  task  of  transforming  all  the  input  raw  features  Xi 
into  the  corresponding  Zi  and  carrying  out  the  computations  in  the  higher  di¬ 
mensional  space  Z.  This  can  be  avoided  by  finding  a  symmetric  and  positive 
semi-definite  function,  called  the  Kernel  function,  between  pairs  of  x-t 

K  :  V  x  V  ->  R+  U  {0}  ,  K(a,  b)  =  K(b,  a)  Va,b  e  X  (21) 

Then,  by  a  theorem  of  Mercer,  a  transformation  /  :  X  — >  Z  is  induced  for  which, 

K(a,b)  =<  >z  Va,b  £  X  (22) 

Then  the  above  Lagrangian  optimization  problem  gets  transformed  to  the  max¬ 
imization  of  the  following  function  of  cq: 

W(a)  =  ^ ^atiajyiyjK(xi,Xj )  (23) 

w*  =  Y,Ui<Azi,  a*  >  0;  (24) 

the  support  vectors  being  the  ones  corresponding  to  the  positive  as.  The  set 
of  hyper-planes  considered  in  the  higher  dimensional  space  Z  have  a  small  es¬ 
timated  VC  dimension  [14].  That  is  the  main  reason  for  the  good  generalization 
performance  of  SVMs. 

Now  that  we  have  real  vectors  for  each  trace,  we  are  at  full  liberty  to  use  the 
standard  kernels  of  SVM  classification.  Let  u1,u2  £  R",.  We  have  the  Gaussian 
Kernel 

K(mi,u2)  =exp(“^2(ui  ~u2)Ti.u2  -u2)^  (25) 

the  Polynomial  Kernel 

K(u1,u2)  =  (uf«2  +  ci)d  +  C2,  ci,c2>0,  d£  N 


(26) 


or  the  Fisher  Kernel 


K(u1,u2)  =  iilrl 2u2 ;  I  =  Ey[UyUy]  (27) 

Having  described  the  various  components  of  our  scheme  for  intrusion  detec¬ 
tion  and  classification,  we  provide  below  a  description  of  the  overall  scheme  and 
experiments  aimed  to  provide  results  on  its  performance.  The  overall  detection 
scheme  executes  the  following  steps: 

1.  For  the  given  T\  attack  traces  of  system  calls  3k,  we  estimate  using  the  EM 
algorithm  a  HMM  model  H 1  for  an  attack  with  ni  states. 

2.  For  given  To  normal  traces  of  system  calls,  3k,  we  estimate  a  HMM  model 
H°  for  the  normal  situation  with  no  states. 

3.  We  compute  the  Fisher  scores  for  either  a  signature-based  intrusion  detec¬ 
tion  or  a  combined  signature  and  anomaly-based  intrusion  detection  using 
equations  (15)  and  (16). 

4.  Using  the  Fisher  scores  we  train  a  SVM  employing  either  one  of  the  kernels 
(Gaussian,  Polynomial,  Fisher). 

5.  Given  a  test  trace  of  system  calls  3k  we  let  the  SVM  classifier  decide  as  to 
whether  the  decision  should  be  1  (attack)  or  0  (normal).  The  Fisher  scores 
of  y  are  computed  and  entered  in  the  SVM  classifier. 

We  performed  numerical  experiments  on  a  subset  of  the  data-set  for  host  based 
intrusion  detection  from  the  University  of  New  Mexico  [13]  [3].  We  need  to  dis¬ 
tinguish  between  normal  and  compromised  execution  on  the  Linux  Operating 
system  of  the  lpr  program  which  are  vulnerable  because  they  run  as  a  privi¬ 
leged  processes.  In  the  experiments,  we  tried  various  kernels  in  the  SVMs.  The 
performance  evaluation  is  based  on  the  computation  of  several  points  of  the  re¬ 
ceiver  operating  characteristic  (ROC)  curve  of  the  overall  classifier;  i.e.  the  plot 
of  the  curve  for  the  values  of  the  probabilities  of  correct  classification  (detection) 
Pd  vs  the  false  alarm  probability  Pfa- 

In  our  experiments  with  HMMs  (both  attack  and  normal),  we  encountered 
two  difficulties  due  to  the  finite  precision  of  computer  arithmetic  (the  long 
double  data  type  of  C/C++  for  instance  is  not  adequate): 

1.  The  larger  the  assumed  number  of  states  for  the  HMM,  the  smaller  the 
values  of  the  probabilities  {qji}-  For  a  fixed  set  of  traces,  like  in  our  case, 
increasing  the  number  of  states  from  say,  5  to  10  or  any  higher  value,  did  not 
affect  the  EM  estimation  (or  the  computation  of  the  Fisher  score)  because, 
despite  the  attacks  and  normal  executions  being  carried  out  in  more  than  5 
(or  n)  stages,  the  smaller  values  of  {qji}  make  the  EM  algorithm  stagnate 
immedeately  at  a  local  optimum. 

2.  Having  long  traces  (200  is  a  nominal  value  for  the  length  in  our  case)  means 
that  values  of  the  forward  and  backward  probabilities  [10]  at(j) ,  AO)  become 
negligible  in  the  EM  algorithm  as  well  as  in  the  computation  of  the  Fisher 
score.  For  the  EM  algorithm,  this  means  being  stagnant  at  a  local  optimum 
and  for  the  computation  of  the  Fisher  score,  it  means  obtaining  score  vectors 
all  of  whose  entries  are  zero. 


3.  While  computing  the  Fisher  scores  (15,16),  if  any  element  of  9  is  very  small 
at  the  point  of  evaluation,  the  increased  length  of  the  overall  Fisher  score  has 
a  distorting  effect  on  the  SVM  learning  algorithm.  For  instance,  while  using 
linear  kernels,  the  set  of  candidate  separating  hyper-planes  in  the  feature 
space  is  directly  constrained  by  this.  This  problem  is  actually  the  result  of 
including  the  statistics  of  non-specific  characteristics  (background-noise,  so 
to  speak)  like  the  transition  and  conditional  output  probabilities  related  to 
the  basic  system  calls  like  break,  exit,  uname  etc. 

To  combat  these  problems  of  numerical  precision,  one  can  go  for  an  enhanced 
representation  of  small  floating  point  numbers  by  careful  book-keeping.  But  this 
comes  at  the  cost  of  a  steep  increase  in  the  complexity  of  the  overall  detection 
system  and  the  time  taken  for  computations. 

We  propose  a  solution  that  simplifies  the  observations  and  segments  each 
trace  into  small  chunks  with  the  idea  of  viewing  the  trace  as  a  (short)  string  of 
these  chunks.  This  solution  removes  the  floating  point  precision  problems. 

4.1  SVM  classification  using  reduced  HMMs 

We  describe  a  technique  for  deriving  a  reduced  order  Attack  HMM  (or  a  normal 
HMM)  from  the  traces  in  the  training  set.  We  choose  a  small  number  of  states 
to  account  for  the  most  characteristic  behavior  of  attacks  (or  of  Normal  program 
execution).  We  also  use  the  observation  that  system-calls  that  constitute  intru¬ 
sions  (attack  system  calls  from  the  set  are  not  exactly  used  for  padding 

(i.e.  r3/r  fts  0  ).  For  every  trace  y,  we  can  compute  the  ratio  of  the  number 

of  occurrences  of  a  system-call  s  and  the  length  of  that  trace.  Call  this  number 
Ps{y)-  We  can  also  compute  the  ratio  of  the  position  of  first  occurrence  of  a 
system-call  s  and  the  length  of  the  trace  (same  as  the  ratio  of  the  length  of 
the  longest  prefix  of  y  not  containing  s  and  the  length  of  y) .  Call  this  number 
5s(y).  Calculate  these  ratios  p8(y),Ss(y)  for  all  system  calls  s6^,  and  for  all 
T\  malicious  traces  in  & . 

For  every  s  £  find  the  median  of  ps(y)  over  all  T\  malicious  traces  in  ^ . 
Call  it  p\ .  Similarly,  compute  the  medians  Vs  £  W .  We  prefer  the  median  over 
the  mean  or  the  mode  because  we  want  to  avoid  being  swayed  by  outliers.  We  now 
propose  a  scheme  for  identifying  attack  states  {u}.  Choose  71, 72  :  0  <  71, 72  <  1. 
Find  subsets  {si,  S2, . . . ,  67}  of  such  that 

I P\  ~  P\j\<  7i -t  1^  -  Slj  |  <  72  ,  Vi,  j  £  {1,2,...,  k}  (28) 

Increase  or  decrease  71,72  so  that  we  are  left  with  a  number  of  subsets  equal 
to  the  desired  number  of  states  n\.  In  practice,  most,  if  not  all,  of  these  subsets 
are  disjoint.  These  subsets  form  the  attack  states.  However,  the  alphabet  is 
no  longer  <3f  but  'W* .  Thus,  for  the  state  Xj  =  {si,  S2, . . . ,  Sfc},  all  strings  of 
the  form  ‘uq,  s^ml,  u>2,  s^(2),  W3, . . . ,  Wk,  sT(fc),  Wk+i  are  treated  as  the  same 
symbol  corresponding  to  it  (with  w\,  W2,  103, . . . ,  Wk,  Wk+i  €  3 V*  and  with  7 r  a 
permutation  on  {1,  2, . . . ,  k}  such  that  is  non-decreasing  with  i).  We  call 

this  symbol  (also  a  regular  expression)  yj. 


Fig.  2.  Plots  of  the  values  of  pi  and  SI  over  normal  (dashed  lines)  and  attack  (solid 
lines)  sequences  used  in  obtaining  reduced  HMMs  for  the  lpr  (normal)  and  lprcp 
(attack)  programs  (the  system  call  index  has  been  renamed  to  ignore  those  system 
calls  never  used). 
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Fig.  3.  Plot  of  ROC  (for  different  number  of  hidden  states)  using  SVMs  and  reduced 
HMMs  using  the  computation  of  the  medians  pl,p°,Sl  and  S°.  We  used  the  trace  dataset 
of  lpr  (normal)  and  lprcp  (attack)  programs. 


Fig.  4.  Plots  of  the  values  of  and  p°  over  normal  (dashed  lines)  and  attack  (solid 
lines)  sequences  used  in  obtaining  reduced  HMMs  for  the  lpr  (normal)  and  lprcp 
(attack)  programs  (For  the  Eject  program:  sys_call_12  =  pipe,  sys_call_13  =  fork  For 
the  Ps  program:  sys_call_10  =  fork,  sys_calLll  =  fcntl) 


Now,  we  can  assign  numerical  values  for  {qji}  and  for  {rjij.  The  transition 
probability  matrix  will  be  given  a  special  structure.  Its  diagonal  has  entries  of 
the  form  :  r,  and  the  first  super-diagonal  has  its  entries  equal  to  1  —  r,  and  all 
other  entries  of  the  matrix  are  equal  to  0.  This  is  the  same  as  using  a  flat  or 
left-right  HMM  [10].  We  set  the  conditional  output  probability  of  observing  the 
compound  output  yj  corresponding  to  state  Xj  to  be  Hj  '■  0  <  tLj  <  1-  We  treat  all 
other  outputs  at  this  state  as  the  same  and  this  wild-card  symbol  (representing 
<3/*  —  {yj}  )  gets  the  probability  1  —  yj.  We  can  make  the  values  jij  all  the  same  or 
different  but  parameterized  in  some  way,  along  with  the  TjS  by  a  single  variable 
so  that  we  can  easily  experiment  with  detection  performance  as  a  function  of 
the  fij ,  Ti .  A  point  to  be  kept  in  mind  all  along  is  that  we  need  to  parse  any 
given  trace  y  into  ni  (or  more)  contiguous  segments.  When  there  are  different 
segmentations  possible,  all  of  them  can  be  constructed  and  the  corresponding 
feature  vectors  tested  by  the  classifier. 

The  above  steps  can  be  duplicated  for  constructing  the  normal  HMM  also. 
A  sharper  and  more  compact  representation  is  obtained  if  the  Attack  tree  and 
the  Normal  tree  do  not  share  common  subsets  as  states.  In  particular,  consider  a 
subset  (of  3/)  x  =  {si,  S2, . . . ,  Si}  that  meets  condition  (28)  for  both  the  normal 
and  attack  traces: 

\Psi  -  Ps,  \<li  ,  \%t  -  <%■  \<l2  , 

Vi,j  G  {1,2,..., A},  0<7ii,72i<l,  £€{0,1}  (29) 

Then,  x  should  clearly  not  be  a  state  in  either  the  Attack  HMM  or  the  Normal 
HMM.  The  signature  based  detection  scheme  would  as  usual  use  only  the  reduced 
attack  HMM.  The  combined  signature  and  anomaly-based  approach  would  use 
both  the  attack  and  normal  HMMS. 

Now  the  overall  detection  scheme  executes  the  following  steps: 


1.  For  the  given  Ti  attack  traces  of  system  calls  34,  we  parse  the  34  into  n\ 
blocks  and  estimate  using  the  reduced  HMM  model  H 1  for  an  attack  with 
ni  states. 

2.  For  given  To  normal  traces  of  system  calls,  34,  we  parse  the  34  into  ri2  blocks 
and  estimate  a  reduced  HMM  model  H°  for  the  normal  situation  with  no 
states. 

3.  We  compute  the  Fisher  scores  for  either  a  signature-based  intrusion  detec¬ 
tion  or  a  combined  signature  and  anomaly-based  intrusion  detection  using 
equations  (15)  and  (16). 

4.  Using  the  Fisher  scores  we  train  a  SVM  employing  either  one  of  the  kernels 
(Gaussian,  Polynomial,  Fisher). 

5.  Given  a  test  trace  of  system  calls  34  we  let  the  SVM  classifier  decide  as  to 
whether  the  decision  should  be  1  (attack)  or  0  (normal).  The  Fisher  scores 
of  y  are  computed  and  entered  in  the  SVM  classifier. 

We  performed  numerical  experiments  on  live  Lpr  and  Lprcp  (the  attacked 
version  of  Lpr)  traces  in  the  data-set  for  host  based  intrusion  detection  [13]  [3].  We 
found  that  the  quadratic  programming  step  of  the  SVM  learning  algorithm  did 
not  converge  when  we  used  linear  and  polynomial  kernels  (because  of  very  long 
feature  vectors).  On  the  other  hand,  SVM  learning  was  instantaneous  when  we 
used  the  Gaussian  kernel  on  the  same  set  of  traces.  The  value  of  the  parameter  a 
in  equation  (25)  made  no  significant  difference.  We  used  the  Gaussian  kernel  (25) 
We  selected  a  small  training  set  (about  one  percent  of  the  whole  set  of  traces 
with  the  same  ratio  of  intrusions  as  in  the  whole  set).  We  trained  the  SVM 
with  different  trade-offs  between  the  training-error  and  the  margin  (through  the 
parameter  c  in  [7])  and  different  number  of  hidden  states  for  the  Attack  and 
Normal  HMMs  .  We  averaged  the  resulting  Pd^Pfa  (on  the  whole  set)  over 
different  random  choices  of  the  training  set  S' . 

We  also  performed  experiments  on  the  eject  and  ps  attacks  in  the  1999  MIT- 
LL-DARPA  data  set  [8].  We  used  traces  from  the  first  three  weeks  of  training. 
In  the  case  of  the  eject  program  attack,  we  had  a  total  of  8  normal  traces  and 
3  attack  traces  in  the  BSM  audit  records  for  the  first  three  weeks.  Needless  to 
say,  the  SVM  classifier  made  no  errors  at  any  size  of  the  reduced  HMMs.  The 
interesting  fact  to  observe  was  that  the  single  compound  symbol  (28)  (for  the 
most  reduced  HMM)  ‘pipe*fork’  was  enough  to  classify  correctly,  thus  learning 
the  Buffer-overflow  step  from  only  the  names  of  the  system  calls  in  the  traces. 
The  ps  trace-set  can  be  said  to  have  more  statistical  significance.  We  had  168 
normal  and  3  attack  instances.  However,  for  all  sizes  of  reduced  HMMs,  all  of 
the  Fisher  scores  for  the  Attack  traces  were  the  same  as  for  the  Normal  ones. 
Here,  too,  at  all  resolutions,  the  buffer-overflow  step  was  learnt  cleanly:  All  the 
reduced  HMMs  picked  the  symbol  ‘f  ork*fnctl’  to  be  part  of  their  symbol  set 
(28).  Here  too,  the  SVM  made  no  errors  at  all.  The  plots  of  p\  and  p°s  in  Fig. 3 
complete  the  picture.  This  data-set  make  us  beleive  that  this  approach  learns 
efficiently  buffer-overflow  type  of  attacks.  It  also  highlights  the  problem  of  a  lack 
of  varied  training  instances. 


We  used  the  SVM"9  1  [7]  program  for  Support  Vector  Learning  authored  by 
Thorsten  Joachims. 


5  SVM  classification  using  gappy-bigram  count  feature 
vectors 


Here,  we  present  an  algorithm  that  uses  a  simpler  feature  that  avoids  the  esti¬ 
mation  of  the  gradient  of  the  likelihoods.  For  any  trace  y  £  3f* ,  we  can  write 
down  a  vector  of  the  number  of  occurrences  of  the  so-called  gappy-bigrams  in  it. 
A  bigram  is  a  string  (for  our  purposes,  over  the  alphabet  3^)  of  length  two  that  is 
specified  by  its  two  elements  in  order.  A  gappy-bigram  ‘rAs’  is  any  finite-length 
string  (over  the  set  3^)  that  begins  with  the  alphabet  s  and  terminates  with  the 
alphabet  s.  Let 


#ss(V)  =  the  number  of  occurences  of  the  gappy  —  bigram 1  sXs  ’  in  y  (30) 


where 

s,s  €  3/  ,  A  £  3X*  U  {e}  ,  e  being  the  null  string.  (31) 

We  write  down  the  T2-long  vector  of  counts  #Ss(V)  for  all  (s,s)  €  ^  x  3f . 


c(y) 


# 

# 


5151 

5152 


(32) 


# 


st 


We  call  the  feature  vector  C(y),  the  count  score  of  y  and  use  this  to  modify  the 
earlier  scheme  using  the  Fisher  score. 

The  new  overall  detection  scheme  executes  the  following  steps: 

1.  We  compute  the  count  scores  using  equation  (32). 

2.  Using  the  count  scores  we  train  a  SVM  employing  either  one  of  the  kernels 
(Gaussian,  Polynomial,  Fisher). 

3.  Given  a  test  trace  of  system  calls  V,  we  let  the  SVM  classifier  decide  as  to 
whether  the  decision  should  be  1  (attack)  or  0  (normal).  The  count  scores 
of  V  are  computed  and  entered  in  the  SVM  classifier. 

We  performed  numerical  experiments  on  live  Lpr  and  Lprcp  (the  attacked  version 
of  Lpr)  traces  in  the  data-set  for  host  based  intrusion  detection  [13]  [3].  We  found 
that  the  quadratic  programming  step  of  the  SVM  learning  algorithm  did  not 
converge  when  we  used  linear  and  polynomial  kernels  (because  of  very  long 
feature  vectors).  On  the  other  hand,  SVM  learning  was  instantaneous  when  we 
used  the  Gaussian  kernel  on  the  same  set  of  traces.  The  value  of  the  parameter 
a  in  equation  (25)  made  no  significant  difference.  Our  experiments  were  of  the 
following  two  types: 


1.  We  selected  a  small  training  set  (about  one  percent  of  the  whole  set  of  traces 
with  the  same  ratio  of  intrusions  as  in  the  whole  set).  We  trained  the  SVM 
with  different  trade-offs  between  the  training-error  and  the  margin  (through 
the  parameter  c  in  [7]).  We  averaged  the  resulting  Pjj,Pfa  (on  the  whole 
set)  over  different  random  choices  of  the  training  set  8F .  Our  average  (as 
well  as  the  median)  values  of  Pd.Pfa  were  0.95  and  0.0. 

2.  We  used  the  whole  set  of  traces  available  for  training  the  SVM  with  different 
tradeoffs  (again,  the  parameter  c  in  [7])  and  used  the  leave-one-out  cross- 
validation  £a  ([7])  estimate  of  Pd,Pfa ■  We  obtained  the  following  values 
for  Pjj ,  Pfa  ■  0.992,0.0. 

We  have  only  one  measured  point  on  the  ROC  curve.  We  also  note  that  this 
detection  system  behaves  like  an  anomaly-based  intrusion  detection  system. 

6  CONCLUSIONS 

We  have  described  a  method  for  incorporating  the  structured  nature  of  attacks, 
as  well  as  any  specific  system-dependent  or  other  ‘expert-hacker’  information, 
in  the  HMM  generative  model  for  malicious  programs.  Using  the  generative 
model,  we  have  captured  the  variability  of  attacks  and  compressed  into  a  vector 
of  real  values,  the  set  of  variables  to  be  examined  for  flagging  off  attacks.  We 
use  these  derived  feature  vectors  in  place  of  variable-length  strings,  as  inputs 
to  the  Support  Vector  Machine  learning  classifier  which  is  designed  to  work 
well  with  small  training  sets.  We  have  presented  a  method  for  deriving  reduced 
HMMs  using  the  temporal  correlations  (28,  29)  between  system  calls  in  traces. 
An  alternative  large-scale  HMM  classifier  would  need  to  use  techniques  from  the 
area  of  large  vocabulary  speech  recognition  [6]  to  grapple  with  the  numerical 
problems  associated  with  full-scale  generative  models  for  attacks  and  normal 
program  execution.  We  also  presented  the  gappy-bigram  count  feature  vector 
for  SVM  based  classification.  We  need  to  develop  versions  of  the  above  intrusion 
detection  systems  that  work  in  real-time,  and  those  that  work  on  distributed 
programs  like  a  network  transaction. 
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